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offered a voucher) becomes statistically insignificant at the .05 level and 
much smaller if the full sample is used. Also, the effect of vouchers is 
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Another Look at the New York City School Voucher Experiment 



ABSTRACT 

This paper reexamines data from the New York City school choice program, the largest 
and best implemented private school scholarship experiment yet conducted. In the 
experiment, low-income public school students in grades K-4 were eligible to participate 
in a series of lotteries for a private school scholarship in May 1997. Data were collected 
from students and their parents at baseline, and in the Spring of each of the next three 
years. Students with missing baseline test scores, which encompasses all those who were 
initially in Kindergarten and 11 percent of those initially in grades 1-4, were excluded 
from previous analyses of achievement, even though these students were tested in the 
follow-up years. In principle, random assignment would be expected to lead treatment 
status to be uncorrelated with all baseline characteristics. Including students with 
missing baseline test scores increases the sample size by 44 percent. For African 
American students, the only group to show a significant, positive effect of vouchers on 
achievement in past studies, the difference in average follow-up test scores between the 
treatment group (those offered a voucher) and control group (those not offered a voucher) 
becomes statistically insignificant at the .05 level and much smaller if the full sample is 
used. In addition, the effect of vouchers is found to be sensitive to the particular way 
race/ethnicity was defined. Previously, race was assigned according to the racial/ethnic 
category of the child's mother. If children with a Black (non-Hispanic) father are added 
to the sample of children with a Black (non-Hispanic) mother, the effect of vouchers is 
smaller and statistically insignificant at conventional levels. 
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Now that the Supreme Court has ruled in the Zelman case that public funds may 
be used to support vouchers to enroll children in private religious schools, many states, 
school districts and parents will seriously consider the desirability of school vouchers. 
This decision naturally depends on many factors, not least of which is whether school 
vouchers are likely to raise student achievement. The best currently available evidence 
on the effect of school vouchers on students’ performance is from a series of three 
randomized experiments conducted in Washington, D.C., Dayton, OH and New York 
City by Paul Peterson and his collaborators. This paper reexamines evidence from the 
New York City voucher experiment, which was conducted by Mathematica Policy 
Research and the Program on Education Policy and Governance at Harvard University. 

The New York City experiment was selected because it is the only one of the 
three experiments for which data have been made available to outside researchers. Two 
additional reasons argue for a detailed evaluation of the New York experiment, however. 
First, the New York experiment is the best documented of the three experiments, and had 
the lowest attrition rate, highest voucher take-up rate, and largest sample size. Second, 
New York is the only one of the three cities to show significant gains in test scores for 
voucher recipients relative to non-recipients for African American students at the 
conclusion of the experiment. 1 In all three experiments, there is no significant difference 
in student performance between those offered a voucher and the control group for other 
racial and ethnic groups, or overall. 



1 See Howell and Peterson (2002), Table 6-3. In Washington, vouchers had a statistically insignificant, 
negative effect on Black students’ scores after three years; in Dayton, vouchers had a positive effect that 
was statistically insignificant at the 0.10 level holding constant family background controls (but significant 
at the 0.10 level without family background controls) in the second and final year of the experiment. 
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The New York City school choice experiment worked as follows. 2 In February 
1997, the School Choice Scholarships Foundation (SCSF), a private foundation, offered 
1,300 scholarships worth up to $1,400 a year for three years to children from low-income 
families (i.e., qualified for free lunch) who were enrolled in Kindergarten through fourth 
grade in New York City public schools. Some 1 1,105 eligible students applied for a 
scholarship between February and late April 1997. Recipients were selected in a series of 
lotteries in May 1997, and began attending private schools the next fall. Mathematica 
randomly selected the students offered vouchers (subject to the SCSF requirement that 85 
percent of recipients be from public schools in the bottom half of the city-wide test score 
distribution) and a control group from the eligible applicants. About three quarters of the 
students who were offered vouchers used them in at least one year; these students 
overwhelmingly attended religious schools. 3 Information from the students and their 
parents was collected prior to the lottery and in the spring of each of the ensuing three 
years. Base weights were constructed so the students in the sample were representative 
of the pool of eligible applicants (which had 70 percent of students from schools with 
below-median scores), and the weights were subsequently adjusted for attrition each year. 

Students were given the Iowa Test of Basic Skills (ITBS) at baseline and in the 
spring of each of the three follow-up years. A decision was made not to test the cohort of 
Kindergarten students applying for scholarships for first grade at baseline, however. 
(Henceforth, the five cohorts of students in different grades will be referred to by their 
grade level at baseline.) The Kindergarten cohort was nonetheless given follow-up ITBS 
tests, along with other students, when they were in grades 1, 2 and 3. In addition, about 

2 See Mayer, Peterson and Myers, et al. (2002) and Hill, Rubin and Thomas (2000) for further details. 

3 Very few of the controls attended private school. 
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1 1 percent of students initially in grades 1-4 lacked baseline scores. 4 These students were 
also given the ITBS in the three follow-up waves. The sample weights do not attempt to 
adjust for missing baseline scores. 

A contribution of our paper is that we include students with missing baseline 
scores in much of our analysis. Previous analyses of achievement omitted students with 
missing baseline test data. 5 Howell and Peterson (2002), for example, report, “A handful 
of additional families were offered vouchers, but they were not included in the evaluation 
for lack of baseline [test] information.” Because of random assignment, however, 
estimates are unbiased even without conditioning on baseline information, so there is an 
efficiency loss from excluding these students entirely. For the subsample with baseline 
scores, omitting the baseline score only trivially affects the estimated treatment effect, as 
one would expect with random assignment. Including students with missing baseline test 
data increases the sample size by 44 percent in the third and final follow-up year; nearly 
30% of those with missing baseline scores were in grades 1-4 when the experiment 
started and should have had baseline scores. An argument can be made that including 
students with missing baseline scores - both those in the Kindergarten cohort and those in 
the other cohorts - is desirable because the weights make no provision for sample 
exclusion due to missing baseline scores, and, more importantly, because using a sample 
that encompasses more grade levels enhances the generalizability of the results. 



4 This is in addition to students who had scores of 0 on the baseline test - of the 1,851 students with 
baseline scores, 199 (10.7%) received a zero score in reading and 324 (17.5%) received a zero score in 
math; 97 (5.2%) received a zero on both exams. 

5 When Mayer, Peterson and Myers, et al. (2002) examine outcomes such as parental satisfaction, however, 
they include observations on students without baseline test data. It is unclear whether Howell and Peterson 
(2002) include or exclude students with missing baseline tests when they study parental responses. Their 
Table 2-3 reports that they include students entering grades 1-4, but that apparently is a misprint, as their 
sample includes students entering grades 2-5. Nevertheless, we could only replicate some of their results 
based on the parental survey if we include the Kindergarten cohort. 
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For African American students, the estimated effect of being offered a voucher is 
much weaker if students with missing baseline scores are included in the analysis. In the 
third follow-up year, for example, the estimated effect of being offered a voucher on 
composite test scores is 2.78 percentile points (t = 1.65) if baseline test scores are 
dropped from the model and the larger sample is used; controlling for baseline covariates 
other than test scores, the effect is 2.12 points with a t-ratio of 1.27. 6 For comparison, 
Mayer, Peterson and Myers report a coefficient of 5.5 points with a t-ratio of 3.42. 

Results are also weaker in the first two follow-up years if students with missing baseline 
scores are included in the sample. These findings raise doubts about the robustness of 
earlier findings of a significant positive effect of offering vouchers on test scores for 
African American students. 

In the next section, we discuss and evaluate the random assignment procedures in 
more detail. In the following section we explore the sensitivity of the results to including 
and excluding students without baseline scores, and examine differences in the treatment 
effect across cohorts. 

Although results for the larger sample encompassing students with and without 
baseline scores cast doubt on the inference that vouchers had a positive impact on Black 
students’ test scores, we explore reasons why vouchers might have been more effective at 
raising scores for Black students than for other students in the grade 1-4 cohorts. Results 
presented in Section 4 suggest that differential characteristics of the initial public schools 
Black and non-Black students attended are not responsible for any differential effect by 
race, even in the sample analyzed by Howell and Peterson. 

6 These estimates control for 30 dummies indicating the original strata students were placed in for random 
assignment. The definition of the strata is described in detail below. 
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Indeed, the data suggest that race itself independently affects the gain from 
vouchers. This leads us to closely examine the particular definition of race used in the 
experiment. There is no universally accepted definition of race. Mathematica assigned 
students to a racial/ethnic group based on a single question on the parental survey that 
asked respondents to select only one of the following categories for the mother or female 
guardian: Black/African American (non-Hispanic), White (non-Hispanic), Puerto Rican, 
Dominican, Other Hispanic, etc. Students were assigned the mother’s race/ethnicity, 
regardless of the father’s race or ethnicity. Thus, if a student is reported as having a 
Black mother and a Hispanic father, he or she was classified as Black. If the same 
student had a Black father and Hispanic mother, however, he or she would have been 
classified as Hispanic. Arguably, father’s race is also relevant. If we augment the sample 
to include students for whom either parent is classified as Black/African American (non- 
Hispanic), the effect of offering a voucher on the composite score in year three falls to 
1.52 points with a t-ratio of 1 .04. So in the broader sample the estimated effect of 
offering vouchers to students with a Black parent is small and statistically insignificant. 

Throughout the paper, we focus mainly on intent-to-treat estimates; that is, the 
impact of offering students a voucher on their test performance, as opposed to the effect 
of attending private school on test performance. We focus on intent-to-treat estimates 
because offering a voucher - as opposed to compelling students to switch to private 
school — is the policy decision that is most relevant, and because there is a cleaner 
statistical interpretation of the intent-to-treat estimates in this case. Nevertheless, the 
effect of attending a private school for varying lengths of time is also of interest. In 
Section 5 we present Instrumental Variables results that estimate the impact of the 
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number of years in private school on student achievement. These results differ from 
those emphasized by Howell and Peterson (2002), who examine the effect of attending 
private school for three full years, and implicitly make strong assumptions about the 
effect of switching to private school on achievement for those who attend private school 
for fewer than three years. 

1. Randomization 

The procedures used to randomly assign students to treatment and control status, 
and select control group members for follow up, are described in Hill, Rubin and Thomas 
(2000). Because multiple children from many families applied for scholarships, and it 
was desired to assign all family members to the same treatment status, students were 
assigned to control and treatment groups in a lottery in which families were the unit of 
observation. Two methods of random assignment were used. 

Briefly, for students from 1,000 families a Propensity Match Pairs Design 
(PMPD) was used, and for students from 960 families a Stratified Block design was used. 
The PMPD method, which introduces considerable complexity to the design, was used 
because in the first lottery many more potential control group members were available to 
be followed up than was money to follow them up. Rather than select a random sample 
of the controls to follow up, it was decided that it would be more efficient to follow up 
the subset that, in some sense, is most alike to treatment group members. Consequently, 
after the treatment group was randomly selected, the members of the control group who 
were followed up were selected by estimating a propensity score model to identify those 
with attributes that were closest to members of the treatment group. 
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According to Hill, Rubin and Thomas (2000), variables used in this model 
included, in order of importance: family size, a dummy indicating above versus below 
median schools, grade level, and initial test scores. 7 Students with missing data were also 
included in the selection. Once the propensity score model was estimated, a matched pair 
for each treatment group family was selected by choosing the nearest available neighbor 
Mahalanobis match from among those with propensity scores close to that of each 
treatment group member. 

The Stratified Block design was much simpler. Samples of screened applicants 
were invited to participate in four sessions at which baseline data were collected and 
ITBS tests were administered. In these lotteries, by design approximately 85 percent of 
the invitees were from schools with below the city median test score. Treatments and 
controls were randomly assigned in four lotteries, and a random sample of participants 
were included in the follow-up sample. Each of these lotteries constitutes a block. 

One can define 30 mutually exclusive “random assignment strata”: 5 lottery 
blocks (PMPD block plus 4 stratified blocks) x 2 school types (above and below city 
median test) x 3 family size groups (1, 2 or 3 or more students). Assignment is random 
within these strata. After the lotteries were held, Mathematica discovered that some 
families had misreported their family size and were thus placed in the “wrong” strata; 
revised strata were created with the latest family size information. Nevertheless, 
assignment to treatment status is random within the original strata that were actually used 
to apportion the sample at the time of random assignment. Peterson and Howell (2002) 

7 Other variables included, in order of importance, were: ethnicity, mother’s education, participation in 
special education, participation in a gifted and talented program, language spoken at home, welfare receipt, 
food stamp receipt, mother’s employment status, educational expectations, number of siblings, and an 
indicator for whether the mother was foreign bom. 



and Mayer, Peterson and Myers, et al. (2002), however, controlled for dummies 
indicating membership in the revised strata, not the actual strata used to make 
assignments. Unless otherwise specified, our results condition on the actual strata that 
were used when random assignments were made. Fortunately, as shown below, the 
choice of original or revised strata has relatively little impact on the results. 

Because assignments were over families, not students, and children from the same 
family tend to have correlated outcomes, we compute bootstrap standard errors that use 
families instead of individual students as the unit for resampling to allow for dependence 
across family members. 8 (Forty-six percent of students in the sample had at least one 
sibling in the sample as well.) Moulton (1990) provides a nice illustration of inference 
problems that can arise from ignoring correlated errors. 

Table 1 reports the mean of several baseline characteristics for the treatment and 
control groups for the full sample, separately for Black and Latino students, and 
disaggregated by cohort for Black students. Because random assignment was 
implemented within strata, regressions were estimated to condition on the 30 original 
randomization strata, and conditional treatment-control differences and t-tests are 
reported as well. For the overall sample, the results indicate small and statistically 
insignificant differences between the treatment and control groups. For example, the 
conditional t-test for the difference in the mean baseline composite test score between 
treatments and controls is only -0.63. Mayer, Peterson and Myers, et al. (2002) present 
similar results. Hill, Rubin and Thomas (2000) report treatment-control differences by 
PMPD versus Stratified Block strata, and find slightly better balance in the PMPD strata, 
but in both cases there are not systematic differences between the treatments and controls. 
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